An Evaluation of the Concept Retrieval Annotation for Spanish-English CLEFER Parallel Corpora

نویسندگان

Rafael Berlanga Llavori

Antonio Jimeno-Yepes

María Pérez Catalán

Dietrich Rebholz-Schuhmann

چکیده

This paper presents a study about the use of the concept retrieval annotation method for parallel corpora. The concept retrieval annotation method (CRA) consists of considering concepts as documents and text chunks as queries [1]. Concepts with higher similarity to text chunks are considered for generating the final semantic annotation. CRA makes use of an existing knowledge resource (KR) from which lexicons are extracted to perform the semantic annotation. Until now, CRA has been applied to mono-lingual scenarios showing a good performance over both very large collections (e.g., CALBCII-SSC) and very large lexicons (e.g., UMLS R © [2]). We have also applied this semantic annotator to different tasks in Biomedicine such as resource discovery [3], relation extraction [4], and sicentific bibliography analysis [5]. In this work, we will apply CRA in a bi-lingual scenario. For this purpose, we make use of the provided lexicons at CLEFER workshop. More specifically, we have made use of the English and Spanish lexicons. In this extended abstract, we first summarize the main features of CRM as a cross-lingual annotator, and then obtained results over the two provided parallel corpora, EMEA and MEDLINE R ©.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Disambiguation for a Multilingual Medical Information System using UMLS

This paper describes techniques for unsupervised word sense disambiguation of English and German medical documents using the Unified Medical Language System (UMLS). We present both monolingual techniques which rely only on the structure of UMLS, and bilingual techniques which also rely on the availability of parallel corpora. The best results are obtained using relationships between terms given...

متن کامل

The Tdt-3 Text and Speech Corpus

The TDT-3 Text and Speech Corpus expands on previous phases of Topic Detection and Tracking data collections, by increasing the number of news sources being sampled, by including Mandarin Chinese as well as English news data, and by introducing new forms of topic annotation. In order to satisfy the specific data and annotation requirements of the TDT-3 Evaluation Plan[1], the LDC refined and su...

متن کامل

A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC

OBJECTIVE To create a multilingual gold-standard corpus for biomedical concept recognition. MATERIALS AND METHODS We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified M...

متن کامل

Interlingual Annotation of Parallel Text Corpora: A New Framework for Annotation and Evaluation

This paper focuses on the next step in the creation of a system of meaning representation and the development of semantically-annotated parallel corpora, for use in applications such as machine translation, question answering, text summarization, and information retrieval. The work described below constitutes the first effort of any kind to provide parallel corpora annotated with detailed deep ...

متن کامل

Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System

We present a method for the automated acquisition of a multilingual medical lexicon (for Spanish and Swedish) to be used within the framework of a medical cross-language text retrieval system. We incorporate seed lexicons and parallel corpora derived from the UMLS Metathesaurus. The seed lexicons for Spanish and Swedish are automatically generated from (previously manually constructed) Portugue...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

An Evaluation of the Concept Retrieval Annotation for Spanish-English CLEFER Parallel Corpora

نویسندگان

چکیده

منابع مشابه

Unsupervised Disambiguation for a Multilingual Medical Information System using UMLS

The Tdt-3 Text and Speech Corpus

A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC

Interlingual Annotation of Parallel Text Corpora: A New Framework for Annotation and Evaluation

Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System

عنوان ژورنال:

اشتراک گذاری